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Abstract First-order logic is the traditional basis for knowledge representation languages. However, its 
applicability to manyreal-world tasks is limited by its inability to represent uncertainty. Bayesian belief 
networks, on the other hand, are inadequate for complex KR tasks due to the limitedexpressivity of the 
underlying (propositional) language. 

The need to incorporate uncertainty into an expressive language has led to a resurgence of work on first- 
order probabilistic logic. This paper addresses one of the main objections to the incorporation of 
probabilities into the language: "Where do the numbers come from?" Wepresent an approach that takes 
a knowledge base in an 

expressive rule-based first-order language, and learns the probabilistic parameters associated with those 
rules fromdata cases. Our approach, which is based on algorithms 

for learning in traditional Bayesian networks, can handle data cases where many of the relevant aspects 
of the situ-ation are unobserved. It is also capable of utilizing a rich 

variety of data cases, including instances with varying causal structure, and even involving a varying 
number ofindividuals. These features allow the approach to be used 

for a wide range of tasks, such as learning genetic propagation models or learning first-order STRIPS 
planningoperators with uncertain effects. 

1 Introduction First-order logic has traditionally formed the basis for mostlarge-scale knowledge 
representation systems. The advantages of first-order logic in this context are obvious: Thenotions of 
"individuals", their properties, and the relations between them provide an elegant and expressive 
frameworkfor reasoning about many diverse domains. The use of quantification allows us to compactly 
represent general rules, thatcan be applied in many different situations. For example, when reasoning 
about genetic transmission of certain proper-ties (e.g., genetically transmitted diseases), we can write 
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down general rules that hold for all people and many properties.Uirfortunately, like all deterministic 
logics, first-order logic is highly limited in its ability to represent our uncertainty 

\Lambda This work was supported in part through the generosity of the Powell foundation, and by ONR 
grantN00014-96-l-0718. 

about the world. A fact is either known to be true, known tobe false, or neither. One cannot say that a 
fact is probably true. Real-world relationships, on the other hand, are noisy and non-deterministic, a fact 
which cannot be captured in the standard logical framework. This severely limits the applicability ofthis 
framework. For example, very few deterministic rules in the domain of genetically transmitted 
properties are actually(absolutely) true in real life. 

This limitation, which is crucial in many domains (e.g.,medical diagnosis), has led over the last decade 
to the resurgence of probabilistic reasoning in AI. In particular, Bayesianbelief networks [Pearl, 1988] 
have been shown to be a principled and useful framework for reasoning in an uncertaindomain. 
However, belief networks do not, by themselves, provide a complete solution for very large-scale 
knowledgerepresentation tasks. The primary reason is their attributebased (propositional) nature, which 
does not support a do-main description in terms of general rules that apply to many qualitatively 
different situations. 

The tension between these two complementary paradigmshas been the primary motivation for some of 
the recent work 

on trying to combine the two [Halpern, 1990; Breese, 1992;Poole, 1993; Ngo et al., 1995]. Knowledge- 
based model construction (KBMC) goes a considerable way towards bridgingthis gap by allowing a set 
of first-order probabilistic logic (FOPL) rules (first-order rules with associated 
probabilisticuncertainty parameters) to be used as a basis for generating Bayesian networks tailored to 
particular problem instances. 

The idea of attaching probabilistic parameters to rulesleaves unanswered one of the major objections 
that have been 

raised about probabilistic representations: the famous "wheredo the numbers come from" question. This 
issue has been addressed satisfactorily for traditional belief networks [Lau-ritzen, 1995; Heckerman, 
1 995]. In this paper, we show how similar techniques can be used to learn the probabilistic pa-rameters 
of FOPL rules from data. 

The ability to learn the uncertainty parameters of a rich first-order representation has the potential to be 
a powerful tool in many situations. We illustrate this using two very differentexamples: reasoning about 
genetically transmitted properties and planning for mobile robots. In both these examples, asin many 
others, the problem of learning these uncertainty parameters is an important one. The skeleton of the 
rules (therules without the parameters) is often easier to acquire than 

the parameters themselves. In fact, an existing set of tradi-tional first-order rules often provides us with 
an appropriate skeleton. In general, the rule structure directly reflects the un-derlying causal structure of 
the domain, a type of knowledge with which human experts are often fairly comfortable. Bycontrast, 
probabilistic parameters are notoriously difficult to elicit from people.The propagation of genetically 
transmitted properties was a key example in some of the early research into belief net- works [Lauritzen 
and Spiegelhalter, 1988]. Given a particular family tree and set of properties being studied, one can con- 
struct a traditional propositional belief network in which the probability of each property being passed 
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from each genera-tion to the next is represented. However, a new network must be specially constructed 
for every family tree. Currently, thisis either done manually or using a special-purpose procedural 
program [Szolovits and Pauker, 1992]. Using a first order representation, one can capture the general 
mechanism of gene inheritance using a small number ofrules. These can be used to automatically 
generate an appropriate belief network for any family tree and any set ofproperties. The generality of 
first-order languages is manifested here in three ways: the same mechanism is at workin different 
family trees, in different generations within the same family tree, and in the propagation of different 
genesfrom parent to child. Our approach enables us to learn the propagation parameters for the different 
classes of properties. We will also be able to learn the strength of the correlations between propagations 
of the different properties, e.g., that eyecolor is usually propagated together with diabetes (because of 
proximity on the chromosome strand).The task of planning for an autonomous agent has traditionally 
been based on a logical representation of ac-tions and their effects. In recent years, there has been a 
growing consensus that the underlying assumptions of thisrepresentation-deterministic actions, reliable 
sensors, and (often) complete observability-are rarely true in practice,particularly in robotics 
applications. As a consequence, probabilistic representations have recently started to play a role 
inplanning [Kushmerick et al., 1993; Dean and Wellman, 1991]. These representations, however, are 
typically attribute-based,and are therefore limited in their ability to capture general patterns in action 
models. FOPL would allow for an integrationof these two formalisms. 

Certain important issues arise when we contemplate learn-ing in domains such as these. First, as in most 
real-life applications, most of the relevant variables are not observed by thelearning agent. In the domain 
of genetically transmitted properties, we may observe the phenotype of some of the people inthe family. 
We will rarely (if ever) have complete information about the entire family. We will hardly ever have any 
informa-tion at all about the genotype of the different people involved. In the planning domain, we will 
typically have access only tothe robot's sensors. The variables corresponding to the true state of the 
world are almost always unobservable.A second common thread is that these domains all require the 
ability to learn from data cases that are qualitatively verydifferent from each other. For example, in the 
genetic propagation example, we wish to learn from different family trees, 

over a varying number of individuals, and representing theinheritance of different properties. In our 
planning domain, we are faced with runs of different length, where the robotundertakes different actions. 
Note that, in both of these applications, each of the (very different) data cases gives usinformation about 
many (or all) of the parameters of interest, so that we cannot simply separate them into distinct 
clustersand run a learning algorithm on each. 

The approach we present in the paper is capable of dealingwith both of these issues. We start out with a 
knowledge base consisting of partially specified FOPL rules: the rule structureis determined, but the 
uncertainty parameters are left unspecified. We are also given a set of data cases, where each datacase 
consists of a context (e.g., the structure of the family tree, or the set of actions taken by the robot) and 
the observa-tions made by the agent. We use a standard KBMC algorithm to generate the network 
structure for each of the data cases. The observations in each data case become evidence in the resulting 
network. The conditional probability tables in theresulting networks are related to the parameters 
corresponding to the rules in the knowledge base. We adaptively learnthese parameters, using an 
extension to the standard EM algorithm [Lauritzen, 1995] for learning the parameters of a 
beliefhetwork with fixed structure and hidden variables. We extend it to deal with an ensemble of 
networks of varying structure,and in which the same parameter can appear several times. - 

The two major advantages of first order languages overpropositional languages-generality and 
compactness-have particular ramifications in a learning context. The general-ity of first-order models 
allows the learned parameters to be reused again and again in many different contexts. The costof 
learning is therefore amortized over a large number of instances in which the benefits are reaped. The 
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compactnessof such representations allows a probabilistic model to be represented using a small number 
of parameters, hopefullyresulting in faster learning. 

2 Knowledge-based model construction Since the idea of constructing belief networks from a first- 
order probabilistic knowledge base was first proposed [Breese, 

1992], several approaches have been developed. Most ofthese augment logic-programmingstyle rules 
with uncertainty parameters. In this paper, we largely follow the frameworkof [Ngo et al., 1995]. In this 
approach, a set of Horn rules describes the ways in which first-order predicates influenceeach other. 
Because the influence may be uncertain, each rule has a uncertainty parameter associated with it. 
Intuitively, onecan think of a rule as identifying a possible set of conditions under which the consquence 
becomes true, and giving theprobability that the consequence actually becomes true as a result of the 
conditions.For example, a very simple model for gene propagation can be expressed in the rules: 

genotype(P,G) 0 

:5 parent(P,Q), genotype(Q,G). 

phenotype(P,G) 0 

:75 genotype(P,G). 

The first rule says that a when a person's parent has a gene,the person will inherit it with probability 0.5. 
The second rule 

says that when a person has a gene, it will be observed in theperson's phenotype with probability 0.75. 

When only one instantiation of a rule can cause a predicateto be true, the associated uncertainty 
parameter is in fact the conditional probability that the head is true given that the bodyis true. 
Sometimes, more than one set of conditions can cause a predicate to be true. For example, if both a 
person's parentshave a gene, the first rule will fire twice, for the two different values of Q. In such cases, 
we need a combination rule toindicate how the different possible causes interact. One very common 
combination rule is noisy-or [Pearl, 1988], whichdescribes a situation in which an effect happens 
whenever any of its potential causes succeeds in making it happen, and thedifferent causal influences act 
independently. More precisely, the probability that the effect does not happen is the probabilitythat all 
the potential causes independently fail to cause it. For example, if the combination rule for genotype is 
noisy-or, thena person both of whose parents have a gene will fail to inherit it with probability (0:5)2 = 
0:25. A more accurate model for genetic propagation may incorporate the number of chromosomes (0, 1, 
or 2) on whichthe gene is found. (Our language allows for multi-valued variables. Rules for such 
variables have several parameterscorresponding to each of the possible values.) A person with the gene 
on both chromosomes in a pair will always propagatea copy to his or her children, while a person with 
one copy will only propagate it with probability 0.5. The combination rulein this case will be a noisy 
addition rule in which the number of genes possessed by a person is the sum of the number ofsuccesful 
propagations from his or her parents. The probability that a gene will be manifested in a person's 
phenotypewill depend both on the number of copies possessed and on whether the gene is dominant or 
recessive. Note that this caneasily be expressed in our language as a property of 

G. Aneven richer model may consider the correlation between the 

propagation of different genes based on their proximity on thechromosome strand. 
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The rules in the knowledge base describe, in a general man-ner, the ways in which various predicates 
interact. They are used in a particular situation to build a Bayesian network, viaa process of knowledge- 
based model construction (KBMC). The resulting network defines a probability distribution overthe 
variables that are relevant in the given situation. A situation is defined by a context, which determines 
the structuralrelationship between the objects in the situation. In the genetic domain, the parent predicate 
is part of the context, defining thefamily tree. In general, the body of a rule will consist of both context 
variables and random variables, which are treated dif-ferently by the model construction process. 

The KBMC algorithm takes as input a knowledge base, acontext, a query and evidence, and returns a 
Bayesian network that can be used to compute the probability of the query giventhe evidence. The 
context, the query, and the evidence are all ground facts in the language. The algorithm proceeds 
bybackward chaining through the Horn rules, iteratively adding nodes representing different ground 
facts to the network. Itstarts by adding the query and the evidence. Each time a variable is added to the 
network, it is matched with the ruleheads to determine what predicates can influence it. Context 

predicates appearing in the rule body must be satisfied forthe rule to apply; if it does, the other 
predicates in the body (appropriately instantiated) are added to the network as ran-dom variables. Figure 
1 shows a simple network constructed for the genetic domain to compute the probability of phenotype(a, 
big ears) given phenotype(b, big ears) and not phenotype(c,big ears). The context consists of the ground 
facts parent(a,c), 

parent(b,c), parent(a,d), and parent(b,d). 

phenotype(c,big_ears) 

genotype(c,big_ears) 

genotype(a,big_ears) phenotype(a,big_ears) 

genotype(b,big_ears) 

genotype(d,big_ears) 

phenotype(b,big_ears) Figure 1 : Generated network for genetic domain. In order to complete the 
specification of the probability dis-tribution defined by the Bayesian network, the KBMC algorithm 
must determine the conditional probability table (CPT)for each node. This table lists the conditional 
probability of the node given each possible value of its parents. TheCPT entries are determined by the 
uncertainty parameters, using the combination rules. In principle, the corribinationrules could determine 
the entries to be any function of the parameters. However, learning is greatly simplified if eachCPT 
entry is associated with at most one parameter which must be learned. Thus, we restrict attention to 
decomposablecombination rules, ones which can be expressed using a set of separate nodes 
corresponding to the different influences, which are then combined in another node. Fortunately, all the 
commonly occuring combination rules (including noisy-or and tree-structured [Boutilier et al., 1996]) 
generate CPTs with this property. The KBMC algorithm automatically gen-erates the decomposed 
representation for these combination rules, thereby facilitating learning.This approach can be applied 
naturally to planning domains. For example, consider planning in a robotics domain whereproperties of 
objects and the effects of actions such as moving and grasping are uncertain. For any (possibly 
uncertain)initial condition and sequence of actions, KBMC is used to build a probabilistic model of the 
world after the actions havebeen taken. The context here consists of the set of objects in the world, some 
known properties of the objects, such as theirtype and shape, and the sequence of actions taken. The 
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random variables include other properties of the objects (such aswhether they are currently wet and 
therefore harder to grab), and the locations of objects at different times. The knowl-edge base consists of 
STRIPS-like rules with uncertainty (as in [Kushmerick et aL, 1993]), stipulating the probability 
thatcertain postconditions will hold given that the preconditions hold and an action is taken. Our use of 
the closed world as-sumption on context predicates fits naturally with the STRIPS 

assumptions.A useful type of combination rule for planning domains is a selection rule. A selection rule 
behaves analogously toa multiplexer: a predicate may match several rules, and the value of a selection 
variable determines which of the rulesis applicable. Selection rules are useful in situations where a 
predicate is influenced by a single cause, but the identity of the cause is itself uncertain. This is typical of 
planning situations. For example, the effects of a move action maydepend on the properties of the 
robot's current location. Since the robot's location is itself a random variable, the propertiesof all 
locations could potentially influence the action's effects. The location is used as a selector to determine 
which propertiesare the relevant ones. Computationally, selection rules require the Bayesian network 
inference algorithm to take advantageof context-specific independence [Boutilier et aL, 1996]. 

3 Learning Our learning task is to take a set of data cases, C, and returna hypothesis 

H that "explains" the data C in the best possibleway. The hope is that a hypothesis that provides a good 

explanation will also generalize well to unseen data cases(modulo concerns about overfitting). Here, 
each data case consists of a context and some evidence. Since our goal isto learn the parameters for a set 
of FOPL rules, we assume that our algorithm is provided with a skeleton rule base, andmust only fill in 
the values of the uncertainty parameters. Formally, our hypothesis space consists of the possible 
valuesfor the rule parameters. We assume for convenience that the rule parameters have values between 
0 and 1 . Thus, ifthere are 

M unspecified parameters, the hypothesis space isthe set of 

M -vectors * =(*1; : : : ; * 

M 

)2[0;1] 
M. We 

consider a hypothesis * as a good explanation for the data if itgives it high probability. Thus, we seek to 
find the maximum 

likelihood hypothesis * that maximizes the probability of thedata 

C.Our first task is to define this probability Pr 
* 

(C). For asingle data case 

C 2 C, we can use the techniques of Section 2.A probabilistic model consisting of the rule skeletons and 
the 
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parameters * defines a belief network for each data case.The probability of the data case given the 
hypothesis is thus defined as the probability that the evidence variables will takeon their given values in 
the distribution defined by this belief network. More precisely, let C be a particular data case, D 

C the observed evidence in the data case, and 

N ^ 
C 

* thebelief network constructed for that data case from its context. 
The probability of C given the hypothesis * is defined as Pr 

(C)de^Pr 

N 

C 
* 

(D 
C 

)• 

At this point we would like to define the likelihood ofthe entire data set as the product of the likelihoods 
of the 

individual data cases, so that if C = fCl ; : : :; C 
N 

g, then 

Pr 

* 

(C) = 

Q 
k 
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i=l Pr 

* 

(C 
i 

). This definition embodies the as-sumption that the different data cases are independent. This 

seemingly innocuous assumption, which is made almost uni-versally in the context of machine learning, 
is not as obviously justified. Our more expressive language gives us the abilityto relate two individuals, 
having the properties of one affect 

the other. But now, we may be uncertain about whether in-dividuals observed in the context of two 
different data cases are related to each other. In the gene inheritance domain, forexample, two people 
appearing in different family trees may in fact have a common ancestor, thereby linking the two trees.In 
some sense, the "ideal' model for a data set 

C is a singlehuge belief network incorporating all of our information. After all, our domain really does 
contain all of these elements.In this network, we can represent our uncertainty concerning the potential 
relationships between individuals in the differentfamily trees. Clearly, practical considerations prevent 
us from taking this course. This is the reason for making the indepen-dence assumption, which in the 
genetics example essentially asserts that different family trees are extremely unlikely to beclosely 
related to each other, and that the influences between data cases are attenuated across many generations. 
However,it is important to keep in mind that this is purely an approximation, and that we need to check 
every time whether it isjustified in our particular situation. 

4 The Learning Algorithm In this section we describe our learning algorithm in detail.The algorithm 
takes as input a set of probabilistic rule skeletons with some of the rule parameters left unspecified, and 
atraining set 

C consisting of contexts and evidence. It attemptsto find the maximum likelihood vector of parameter 
values 

using a two-stage process. In the first stage, it constructs abelief network for each data case by 
mimicking the knowledge based model construction process. In the second stage,it attempts to find the 
maximum likelihood hypothesis using the EM method (in a manner analogous to its use for 
learningBayesian networks [Lauritzen, 1995]). 

Let C be a particular data case, D 

C the evidence in the data 

case, and N 

C the the belief network constructed for that data 

case. The learning algorithm begins by building the networkfor each data case, via back-chaining from 
the evidence nodes. 
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The second phase of the algorithm uses the EM algorithm tosearch for the value of 

* that maximizes the likelihood ofthe evidence in the constructed belief networks. We briefly 

review the EM algorithm and its application to our problem.To understand the intuition,consider the 
problem of maximum likelihood parameter learning in standard Bayesian networksfrom fully 
observable data. There, the network structure is identical in all data cases, and the parameters are simply 
theCPT entries. Let 

X be some node in the network and U beits parents. The maximum likelihood estimate for the CPT 

entry Pr(X = x j U = u) is simply the number of data caseswhere 

X; U take the values x; u respectively, divided by thenumber of data cases where 

U takes the value u.If our data cases have missing values, we can no longer 

perform this counting process. The EM algorithm essentiallyprovides us with a way for probabilistically 
"filling in" the missing values. It starts out with some initial set of parame-ters, and uses them to 
compute a probability distribution over the various possible completions of each partial data case.Each 
completion is then treated as a fully-observed data case, but one whose weight is its probability. A new 
set of pa-rameters is then computed as described above, over the set 

of weighted data cases. The process is now repeated withthe new set of parameters. Standard results (see 
[McLachlan and Krishnan, 1997]) imply that this procedure converges toa set of parameters which is a 
local maximum in the likelihood space. In practice, of course, one cannot generate everyfully observable 
completion for a partially observable data case C, since the number of such completions is exponen-tial 
in the number of unobserved variables in 

C. Luckily ,the total weight of the completions for C which contribute tothe weighted count of the event 
X = x; U = u is simply 

Pr(X = x;U = ujD 

C 

).In our context, the basic idea is the same. The two main 

differences are that the networks for the different data caseshave different structures, so that a parameter 
may appear in a variety of contexts, and that the same parameter can appearmore than once in the same 
network. To see that neither of these is a problem, consider some rule r in our knowledgebase. By 
assumption, 

r is associated with some set of pa-rameters, and these appear only in 

r. Recall that our use ofdecomposable combination rules implies that each node in 

the generated network is associated with at most one rule.(Some nodes simply compute deterministic 
functions such as or and summation.) Thus, while the same rule can inducemore than one node in the 
network for a data case, all of the nodes have an identical local structure: the CPTs are the sameand 
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incorporate the parameters in the same way. Thus, each of the nodes derived from r can be viewed as a 
separate "ex-periment" for the parameters associated with 

r. Each shouldtherefore make a separate contribution to the "count" for those 

parameters. This situation is now analogous to the one thatarises in learning Hidden Markov Models 
[Rabiner and Juang, 1986], where we also have data cases of varying structure andparameter sharing in 
each data case. 

Formally, let r be some rule, and * be one of its parameters.Let 

x 

r 

;u 

r be the values for a node and its parents that areassociated with 

* in a node generated by rule r. For every data case 

C,letX 

C 

r be the set of nodes in the network 

N 

C 

which correspond to the rule r. In each iteration of the EMalgorithm, we begin with some set of 
parameters 

*, and adjusteach of them according to the weighted counts, as follows: 

new-* 

P 

C 

P 

X2X 

c 

rPr 
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* 

(X = x; Parents(X) = u j D 
C 

) 
P 

c 
p 

X2X 

c 

rPr 

* 

(Parents(X) = uj D 

C 

) 

The values of Pr 

* can be computed using standard Bayesiannetwork calculations in the network 
N 

C , constructed using 

the current guess for *.The fact that the we get the maximum likelihood estimate 

for our parameters in the fully observable case implies the de-sired convergence property [McLachlan 
and Krishnan, 1997]: 

Theorem 1 : The iterative EM procedure described aboveconverges to a set of parameters 

• * which induces a localmaximum in the likelihood Pr 

* 
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It is instructive to compare the complexity of our learningalgorithm to that of parameter estimation in 
standard (propositional) Bayesian networks. Our learning procedure involvesa single initial phase of 
constructing the network structure for each data case. The cost of this phase is insignificant relative 

to the cost of the EM iterations, each of which involves run-ning Bayesian network inference for each 
data case. This cost (per iteration) is the same as in the case of learning parame-ters for propositional 
Bayesian networks. We do not have an analysis of the number of iterations required for 
convergenceeither for standard Bayesian networks or for our rules. The first-order rules, however, 
support a significant reduction inthe dimensionality of the parameter space via the parameter sharing 
encoded in the rules. In general, a reduction in thenumber of parameters tends to speed up convergence. 
For example, it has recently been shown [Friedman and Goldszmidt,1996] that exploiting context- 
specific independence in traditional Bayesian networks can speed up the learning processconsiderably. 

The learning procedure presented here suffers from twopotential problems: local maxima and 
overfitting. Since we are only attempting to learn numerical parameters, any over-fitting would be 
numerical, i.e., learning the parameter values to too great a degree of accuracy. Techniques such as 
randomrestart may alleviate the problem of local maxima, but possibly at the cost of increasing the 
danger of overfitting. Futurework should determine how serious these issues are for this procedure, and 
develop techniques to deal with them. 

5 Experimental Results We tested the learning algorithm on a simple gene propaga-tion model with 
three parameters. We generated data cases 

from a given set of rules with associated parameters. We thengave our algorithm the "correct" rule 
structures and used it to learn the parameters from the data cases. In addition to thetwo rules shown in 
Section 2, there was a rule for spontaneous acquisition of a gene, with uncertainty parameter 0.05. 
Theexperiments tested the ability of the algorithm to learn the correct values of the parameters from n 
data cases, for variousvalues of 

n between 10 and 1000. Each data case described afamily tree relating between 20 and 40 people, with 
the phenotype being observed for approximately one third of them.Ten sets of experiments were run for 
every value of 

n, eachwith a different training set constructed from the same set of 

parameters. The results are shown in Figure 2. Figure 2(a) shows the mean absolute error of the learned 
parameter values as corn-pared to their true values. The graph shows the average, best and worst results 
for each value of n. Figure 2(b) shows therelative error for the parameter values. Figure 2(c) describes 
the performance of the learned parameters in predicting theprobabilities of events in a test set. The test 
set consisted of 500 data cases, generated from the same model as the trainingdata, but which were not 
shown to the learning algorithm. The figure shows the mean relative error of the predicted proba-bilities 
as compared to the true probabilities. Notice that the relative error of the predictions is much smaller 
than the rel-ative error of the parameter values. (Note the scale of the two graphs.) It has often been 
observed that the predictiveperformance of a Bayesian network is not sensitive to small error in the 
parameters. Our results indicate that a similarphenomenon may hold for the parameters of noisy rules. 
This type of robustness to small errors greatly increases the appli 0 0.1 0.2 0.3 0.4 0.5 0.6 

0 100 200 300 400 500 600 700 800 900 1000 0 
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cability of learning. 6 Conclusion We have shown how techniques for learning in standard be-lief 
networks can be adapted to learning in first-order probabilistic models. This allows us to learn in 
domains wherewe encounter many qualitatively different circumstances that share an underlying causal 
structure and uncertainty param-eters. Clearly, more extensive experiments are required in order to test 
the usefulness of our approach in practice.Our presentation in this paper was based on a specific 
representation language for the first-order probabilistic rules. Weare currently investigating the 
problem of defining expressive languages for modeling complex stochastic domains, includ-ing 
languages that support the representation of continuous variables and temporal processes, and reasoning 
at differentlevels of granularity. Whatever the results of this research, we expect that a model in our 
language will continue to define abelief network for a given situation, and that the conditional 
probabilities will be functions of various parameters. There-fore, the ideas in this paper should continue 
to be applicable. 

Finally, we have focused on learning the numeric uncer-tainty parameters of first-order probabilistic 
rules. We did not address the problem of learning the structure of the rules. Inrecent years, there has 
been significant work both on learning the structure of belief networks (see [Heckerman, 1995] fora 
survey) and on inductive logic programming [Muggleton, 1992] -learning deterministic first-order 
rules. It would bevery interesting to see whether the techniques developed in these two areas of research 
can be integrated, allowing us tolearn the causal/rule structure of complex uncertain domains. 
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